STA2023 Review:
Point Estimation

Introduction: Topics

  • Basic descriptives:
    • Continuous variables:
      • Mean
      • Median
      • Variance and standard deviation
      • Range and interquartile range
    • Categorical variables:
      • Count
      • Overall percentage
      • Row percentage
      • Column percentage

Introduction: Data

  • We will be using data from the kingdom of Equestria (yes, from My Little Pony).

  • Mane Six:

    • Twilight Sparkle (Unicorn \to Alicorn)
    • Applejack (Earth Pony)
    • Fluttershy (Pegasus)
    • Pinkie Pie (Earth Pony)
    • Rainbow Dash (Pegasus)
    • Rarity (Unicorn)

Introduction: Data

Introduction: Data

  • Name: the pony’s name
  • Type: type of pony (Earth, Pegasus, Unicorn, Alicorn)
  • Sex: sex/age of pony (Coal, Filly, Stallion, Mare)
  • Flying speed: average flying speed (km/hr) for winged ponies
  • Friendship: a harmony index from friendship activities (0-10)
  • Magical energy: measured magical energy output (sparkles) for magical ponies
  • Tail shimmer: how much light reflected by the pony’s tail (lux)

Types of Variables: Qualitative

  • A qualitative or categorical variable classifies an observation into one of two or more groups or categories.
    • Nominal: purely qualitative and unordered
    • Ordinal: data can be ranked, but intervals between ranks may not be equivalent
  • Examples:
    • satisfaction rating
    • favorite color
    • type of pet
    • education level
    • blood type

Types of Variables: Quantitative

  • A quantitative or continuous variable takes numerical values for which arithmetic operations such as adding and averaging make sense; typically has a unit of measure.
    • Interval: meaningful differences between values, but no true zero point
    • Ratio: meaningful differences and a true zero point
  • Examples:
    • age (years)
    • temperature (Celsius)
    • daily hours of sleep
    • ACT or SAT score
    • height (inches)

Types of Variables: Example

  • Name: the pony’s name
  • Type: type of pony (Earth, Pegasus, Unicorn, Alicorn)
  • Sex: sex/age of pony (Coal, Filly, Stallion, Mare)
  • Flying speed: average flying speed (km/hr) for winged ponies
  • Friendship: a harmony index from friendship activities (0-10)
  • Magical energy: measured magical energy output (sparkles) for magical ponies
  • Tail shimmer: how much light reflected by the pony’s tail (lux)

Describing Data: Why?

  • Why do we describe data? We want to tell a story!
    • Summarize n observations into a single description
    • Understand what is in the data
    • Spot patterns, missing data, or outliers
    • Compare groups or spot differences or oddities

Describing Data: How?

  • How do we describe data?
    • Numbers
      • Frequency table
      • Mean & standard deviation
      • Median & IQR
    • Graphs
      • Bar charts
      • Box plots
      • Histograms

Point Estimation: Mean

  • Mean: the average of a set of values

\bar{y} = \frac{\sum_{i=1}^n y_i}{n}

  • Find the mean for the flying speeds (km/hr) of 5 ponies: {10, 20, 30, 40, 100}

\bar{y} = \frac{\sum_{i=1}^n y_i}{n} = \frac{10 + 20 + 30 + 40 + 100}{5} = 40

  • The average flying speed for winged ponies is 40 km/hr.

Point Estimation: Median

  • Median: The middle value in an ordered dataset.
    • When we have an even number of observations, we average the two middle.
  • Find the median for the flying speeds (km/hr) of 5 ponies: {10, 20, 30, 40, 100}
    • First, we sort the data: {10, 20, 30, 40, 100}
    • Then, find the middle number: 30
  • The median flying speed for winged ponies is 30 km/hr.

Point Estimation: Variance

  • Variance: A measure of spread; the average of squared differences from the mean.
    • Higher variance = data has more spread.
    • In squared units of the data.

s_y^2 = \frac{\sum_iy_i^2 - (\sum_iy_i)^2/n}{n-1}

  • Find the variance for the flying speeds (km/hr) of 5 ponies: {10, 20, 30, 40, 100}

s_y^2 = \frac{\sum_iy_i^2 - (\sum_iy_i)^2/n}{n-1} = \frac{(10^2+...+100^2)-(10+...+100)^2/5}{4} = 1250

  • The variance is 1250 (km/hr)2

Point Estimation: Standard Deviation

  • Standard Deviation: A measure of spread; the average distance from the mean.
    • Higher standard deviation = data has more spread.
    • Same units as the data.

s_y = \sqrt{s_y^2}

  • Find the standard deviation for the flying speeds (km/hr) of 5 ponies: {10, 20, 30, 40, 100}

s_y = \sqrt{s^2_y} = \sqrt{1250} \approx 35.36

  • The standard deviation is 35.36 km/hr.

Point Estimation: Range

  • Range: difference between the maximum and minimum values

\text{range} = \text{max}(y) - \text{min}(y)

  • Find the range for the flying speeds (km/hr) of 5 ponies: {10, 20, 30, 40, 100}

\begin{align*} \text{range} = \text{max}(y) - \text{min}(y) = 100 - 10 = 90 \end{align*}

  • The range of the flying speeds is 90 km/hr.

Point Estimation: Interquartile Range

  • Interquartile Range (IQR): range of the middle 50% of the data.

\text{IQR} = \text{P}_{75} − \text{P}_{25}

  • Find the IQR for the flying speeds (km/hr) of 5 ponies: {10, 20, 30, 40, 100}
    • Recall that the median is 30.
    • We then find P_{25} using {10, 20} and P_{75} using {40, 100}
    • Thus, P_{25} = 15 and P_{75} = 70.

\begin{align*} \text{IQR} = \text{P}_{75} − \text{P}_{25} = 70 - 15 = 55 \end{align*}

  • The IQR of the flying speeds is 55 km/hr.

Point Estimation: Proportion

  • Proportion: a type of mean for categorical data
    • Often expressed as a percentage
    • Useful for categorical responses

\hat{p} = \frac{\sum_{i=1}^n y_i}{n},

  • Note that in this case,

y_i = \begin{cases} 1 & \text{if in category }i \\ 0 & \text{otherwise} \end{cases}

Point Estimation: Proportion

  • Find the proportion of ponies that have wings in the following sample: {Y, N, Y, Y, N, Y}

  • Count the number of “Y” responses and divide by total:

\hat{p} = \frac{\sum_{i=1}^n y_i}{n} = \frac{4}{6} \approx 0.67

  • The proportion of ponies with wings is 0.667 (or 66.7%).

Point Estimation: Frequency Table

  • Frequency table: A table showing how often each value appears in a dataset.
    • Useful for categorical responses.
    • For each category, i, we report n_i (\%_i)
  • Find the freqency table for the following sample of 8 ponies: {Earth, Pegasus, Unicorn, Earth, Pegasus, Pegasus, Unicorn, Alicorn}
  • Frequencies:
    • Alicorn: n_{\text{A}} = 1
    • Earth: n_{\text{E}} = 2
    • Pegasus: n_{\text{P}} = 3
    • Unicorn: n_{\text{U}} = 2
  • Proportions:
    • Alicorn: \hat{p}_{\text{A}} = 1/8 = 0.125
    • Earth: \hat{p}_{\text{E}} = 2/8 = 0.250
    • Pegasus: \hat{p}_{\text{P}} = 3/8 = 0.375
    • Unicorn: \hat{p}_{\text{U}} = 2/8 = 0.250

Point Estimation: Frequency Table

  • Putting this into a table,

Point Estimation: Contingency Table

  • Contingency table: A table that summarizes two qualitative variables and their overlap.

  • We will not concern ourselves with the derivation, but will rely on R.

  • Consider this data,

Point Estimation: Contingency Table

  • The resulting contingency table would look someting like this:
    • We are using column totals as our denominators.
# A tibble: 4 × 3
  pony_type No        Yes      
  <chr>     <chr>     <chr>    
1 Alicorn   0 (0.0%)  1 (25.0%)
2 Earth     2 (50.0%) 0 (0.0%) 
3 Pegasus   0 (0.0%)  3 (75.0%)
4 Unicorn   2 (50.0%) 0 (0.0%) 

Graphs: Box Plots

  • Box plots display the distribution of a continuous variable using the five number summary:

    • Whisker: Minimum
    • Beginning of box: 25th percentile (first quartile; Q1, P25)
    • “Middle” of box: Median (50th percentile, second quartile; Q2, P50)
    • End of box: 75th percentile (third quartile; Q3, P75)
    • Whisker: Maximum
  • We use box and whisker plots to get an idea of the spread and skewness of the data.

  • Note: there are different ways to define the whiskers.

    • I use the min/max as whiskers when sketching by hand.
    • ggplot() uses 1.75 \times IQR.

Graphs: Box Plots

  • Describe this box plot:

Graphs: Box Plots

  • Describe this box plot:

Graphs: Box Plots

  • Describe this box plot:

Graphs: Box Plots

  • Describe this box plot:

Graphs: Histograms

  • Histograms show the distribution of a continuous variable.

    • What is the shape of the distribution?
    • Is the distribution symmetric? Skewed? How skewed?
  • Values are grouped into intervals (“bins”), then the bin height demonstrates how many values fall into that interval.

  • This allows us to quickly see if there are any oddities.

    • Increased proportion of a specific value/bin.
      • Zero inflation? Value used to indicate missing?
    • Any values that are “out in the tail”.
      • Outlier? Data entry error?

Graphs: Histograms

  • Describe the histogram:

Graphs: Histograms

  • Describe the histogram:

Graphs: Histograms

  • Describe the histogram:

Graphs: Histograms

  • Describe the histogram:

Graphs: Histograms

  • Describe the histogram:

Graphs: Bar Graphs

  • Bar graphs display the distribution of categorical data.
    • The frequency or proportion of observations is displayed on the bar graph.
  • Bar graphs usually have categories on the x-axis and counts or proportions on the y-axis.
    • Note that we could flip the axes to create a vertical bar graph.
  • Note that the bars are separated on the x-axis to indicate the lack of continuity.

Graphs: Bar Graphs

  • Consider the bar graph, below.

Graphs: Side-by-Side Bar Graphs

  • Consider the bar graph, below.

Graphs: Stacked Bar Graphs

  • Consider the bar graph, below.

Graphs: Histograms vs Bar Graphs

  • We have now reviewed two “bar style” graphs that we see regularly: histograms and bar graphs.

  • We use histograms to see the distribution of continuous variables.

    • The x-axis represents numeric intervals.
    • The bars touch each other to represent continuity.
  • We use bar graphs to see the distribution of categorical variables.

    • The x-axis represents categories.
    • The bars do not touch each other, implying distinct categories.

Graphs: Scatterplots

  • Scatterplots allow us to look at the relationship between two continuous variables.
    • Each point on the graph represents one observation.
  • What statisticians use scatterplots for:
    • Explore patterns (aka trends or relationships).
      • Linear relationships.
      • Non-linear relationships.
    • Detect clusters of observations.
    • Find oddities in the data (outliers).
  • When we describe the relationship, we are really answering the question, “As x increases, what happens to y?”

Graphs: Scatterplots

  • Consider the scatterplot, below.

Graphs: Scatterplots

  • Consider the scatterplot, below.

Graphs: Scatterplots

  • Consider the scatterplot, below.

Graphs: Scatterplots

  • Consider the scatterplot, below.

Graphs: Scatterplots

  • Consider the scatterplot, below.

Wrap Up

  • We have covered (“reminded” ourselves of) a lot today!
    • Always remember that I do not expect you to:
      • Memorize code.
      • Produce code in a timed environment.
      • Automatically know how to do these things.
    • I do expect you to:
      • Use your resources (lecture slides, GitHub website, Discord).
      • Try your best.

Wrap Up

  • Today’s lecture:
    • Basic summarization of data.
    • Basic data visualization.
  • This week’s lab:
    • Summarizing data
    • Visualizing data
  • Next week:
    • Review of statistical inference.
    • Confidence intervals and hypothesis tests.
      • One sample means.
      • Two sample means.
        • Independent data.
        • Dependent data.

Wrap Up

  • Daily activity: the .qmd we worked on during class.
    • Due date: Monday, June 23, 2025.
  • You will upload the resulting .html file on Canvas.
    • Please refer to the help guide on the Biostat website if you need help with submission.